Probability Smoothing

نویسنده

  • Djoerd Hiemstra
چکیده

Smoothing overcomes the so-called sparse data problem, that is, many events that are plausible in reality are not found in the data used to estimate probabilities. When using maximum likelihood estimates, unseen events are assigned zero probability. In case of information retrieval, most events are unseen in the data, even if simple unigram language models are used (see N-GRAM MODELS): Documents are relatively short (say on average several hundreds of words), whereas the vocabulary is typically big (maybe millions of words), so the vast majority of words does not occur in the document. A small document about “information retrieval” might not mention the word “search”, but that does not mean it is not relevant to the query “text search”. The sparse data problem is the reason that it is hard for information retrieval systems to obtain high recall values without degrading values for precision, and smoothing is a means to increase recall (possibly degrading precision in the process). Many approaches to smoothing are proposed in the field of automatic speech recognition [1]. A smoothing method may be as simple so-called Laplace smoothing, which adds an extra count to every possible word. The following equations show respectively (8) the unsmoothed, or maximum likelihood estimate, (9) Laplace smoothing, (10) Linear interpolation smoothing, and (11) Dirichlet smoothing [3]:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Two-step Smoothing Estimation of the Time-variant Parameter with Application to Temperature Data

‎In this article‎, ‎we develop two nonparametric smoothing estimators for parameter of a time-variant parametric model‎. ‎This parameter can be from any parametric family or from any parametric or semi-parametric regression model‎. ‎Estimation is based on a two-step procedure‎, ‎in which we first get the raw estimate of the parameter at a set of disjoint time...

متن کامل

Generalized Probability Smoothing

In this work we consider a generalized version of Probability Smoothing, the core elementary model for sequential prediction in the state of the art PAQ family of data compression algorithms. Our main contribution is a code length analysis that considers the redundancy of Probability Smoothing with respect to a Piecewise Stationary Source. The analysis holds for a finite alphabet and expresses ...

متن کامل

A d-step Fixed-lag Smoothing Algorithm for Markovian Switching Systems

* Research supported in part by National Nature Science Foundation of China Abstract A suboptimal approach to the d( 0 d ≥ ) step fixed-lag smoothing problem for Markovian switching systems is presented. Multiple Model Estimation techniques have been widely used in solving state estimation problems of these systems. We demonstrated that the mode probability of each fixed-lag smoother at time k-...

متن کامل

Near-Optimal Smoothing of Structured Conditional Probability Matrices

Utilizing the structure of a probabilistic model can significantly increase its learning speed. Motivated by several recent applications, in particular bigram models in language processing, we consider learning low-rank conditional probability matrices under expected KL-risk. This choice makes smoothing, that is the careful handling of low-probability elements, paramount. We derive an iterative...

متن کامل

Behavior near zero of the distribution of GCV smoothing parameter estimates

It has been noticed by several authors that there is a small but non-zero probability that the GCV estimate 2 of the smoothing parameter in spline and related smoothing problems will he extremely small, leading to gross undersmoothing. We obtain an upper bound to the probability that the GCV function, whose minimizer provides ,~, has a (possibly local) minimum at 0. This upper bound goes to 0 e...

متن کامل

An Approach to Improve the Smoothing Process Based on Non-uniform Redistribution

In the paper, an effective technique, based on the non-uniform redistribution probability for novel events (the unknown events), to improve the smoothing method in language models is proposed. Basically, there are two processes in the smoothing methods: 1) discounting and 2) redistributing. Instead of uniform probability assignment to each unseen events used by most smoothing methods, we propos...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009